feat: integrate FinWorldJudge with OpenJudge support & add project blogs#21
feat: integrate FinWorldJudge with OpenJudge support & add project blogs#21TaoShuchang wants to merge 9 commits intomainfrom
Conversation
…on OpenJudge - Refactored reward_metric_helper, optimizing the data structure and statistical logic of OpenJudge and Finance Evaluator - Added the DeepFinanceJudgeByOpenJudge class to achieve unified calls and weighted fusion across multiple Graders - Supports both RM Gallery and Finance Evaluator as evaluation sources, enhancing evaluation dimensions - Asynchronously calls OpenJudge Runner, adding retry and error handling mechanisms - Implements cached loading of reference answers, improving RM Gallery evaluation efficiency - Added tool call penalty calculation, fusing step_reward and scores from each Grade - Added automatic saving of debug information when OpenJudge scores for each Grade are zero - Log recording and time consumption statistics cover the entire evaluation process, facilitating performance monitoring and troubleshooting
…dependent Model Configuration - Added a new OpenJudge-based `FinanceCompositionEvaluator` to replace the legacy implementation. - Implemented domain-based routing to direct requests to the appropriate set of graders, supporting multiple fields such as stock analysis and industry research. - Implemented an asynchronous pairwise evaluation interface that returns scores within the 0–1 range. - Enabled independent configuration for `finance_llm`; if not explicitly configured, the general `openjudge_llm` model is reused. - Cleaned up redundant imports and deprecated code within `DeepFinanceJudgeByOpenJudge`. - Updated `deep_finance_openjudge_template.yaml` to include documentation for the `finance_llm` option. - Refined the description of "evidence traceability" in `deep_finance.md`, renaming it to "Reference Logic Audit" and enhancing the details regarding the workflow and judgment criteria.
There was a problem hiding this comment.
Code Review
This pull request refactors the reward calculation and evaluation framework for the Finance Deep Research Agent, transitioning from the RM Gallery implementation to a more flexible OpenJudge-based FinanceCompositionEvaluator. It also updates the training infrastructure, configuration templates, and documentation to support this new evaluation approach. My feedback focuses on improving the robustness of the training scripts by removing hardcoded paths in favor of environment variables, fixing documentation errors, and cleaning up unused configuration templates.
| export TRAIN_DATA_PATH="/mnt/data_cpfs/taoshuchang.tsc/deepresearch/AgentJet_new/tutorial/example_deep_finance/data/train_merged_all.json" | ||
| export TRAIN_REF_ANS_PATH="/mnt/data_cpfs/taoshuchang.tsc/deepresearch/AgentJet_new/tutorial/example_deep_finance/data/Reference_merged_all.json" |
There was a problem hiding this comment.
The script hardcodes user-specific paths for TRAIN_DATA_PATH and TRAIN_REF_ANS_PATH. These paths will not work on other machines. The script already sources the .env file, which is the correct place for these configurations. Please remove these export lines to allow the values from the .env file to be used.
| export RAY_CLUSTER_MODE="multi_node" | ||
| export DEEPFINANCE_PATH="${ENV_SERVICE_ROOT}" # AgentJet 内部可能使用此路径 | ||
| export DEEPFINANCE_PATH="${ENV_SERVICE_ROOT}" | ||
| export DEEPFINANCE_SCRIPT="source /mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080" |
There was a problem hiding this comment.
The DEEPFINANCE_SCRIPT variable contains a hardcoded, user-specific path to conda.sh (/mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh). This will fail on any other developer's machine. The .env_sample file already defines a CONDA_PATH variable for this purpose. Please use that variable here.
| export DEEPFINANCE_SCRIPT="source /mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080" | |
| export DEEPFINANCE_SCRIPT="source ${CONDA_PATH} && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080" |
| export TRAIN_DATA_PATH="/mnt/data_cpfs/taoshuchang.tsc/deepresearch/AgentJet_new/tutorial/example_deep_finance/data/train_merged_all.json" | ||
| export TRAIN_REF_ANS_PATH="/mnt/data_cpfs/taoshuchang.tsc/deepresearch/AgentJet_new/tutorial/example_deep_finance/data/Reference_merged_all.json" |
There was a problem hiding this comment.
The script hardcodes user-specific paths for TRAIN_DATA_PATH and TRAIN_REF_ANS_PATH. These paths will not work on other machines. The script already sources the .env file, which is the correct place for these configurations. Please remove these export lines to allow the values from the .env file to be used.
|
|
||
| export PYTHONPATH="${AJET_ROOT}:${OPENJUDGE_ROOT}:${PYTHONPATH}" | ||
| export DEEPFINANCE_PATH="${ENV_SERVICE_ROOT}" | ||
| export DEEPFINANCE_SCRIPT="source /mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080" |
There was a problem hiding this comment.
The DEEPFINANCE_SCRIPT variable contains a hardcoded, user-specific path to conda.sh (/mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh). This will fail on any other developer's machine. The .env_sample file already defines a CONDA_PATH variable for this purpose. Please use that variable here.
| export DEEPFINANCE_SCRIPT="source /mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080" | |
| export DEEPFINANCE_SCRIPT="source ${CONDA_PATH} && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080" |
| | **model** | **finance** | **others** | **overall** | | | | | | | | | | | | | | ||
| | ------------------------------- | ----------------- | ---------- | --------------------- | ----------- | ----------------- | ----------------- | ------- | --------------------- | ----------- | ----------------- | ----------------- | ------- | --------------------- | ----------- | ----------------- | | ||
| | | comprehensiveness | insight | instruction_following | readability | **overall_score** | comprehensiveness | insight | instruction_following | readability | **overall_score** | comprehensiveness | insight | instruction_following | readability | **overall_score** | | ||
| | **Qwen3-30B-A3B-Instruct-2507** | 0.181 | 0.169 | 0.191 | 0.211 | 0.184 | 0.112 | 0.111 | 0.117 | 0.137 | 0.118 | 0.122 | 0.119 | 0.128 | 0.148 | 0.127 | | ||
| | **Tongyi DeepResearch** | 0.291 | 0.282 | 0.316 | 0.313 | 0.296 | 0.270 | 0.260 | 0.289 | 0.290 | 0.274 | 0.273 | 0.263 | 0.293 | 0.293 | 0.277 | | ||
| | **Claude 3.7** | 0.404 | 0.398 | 0.465 | 0.416 | 0.417 | 0.412 | 0.406 | 0.462 | 0.417 | 0.423 | 0.411 | 0.405 | 0.462 | 0.417 | 0.422 | | ||
| | **Ours** | 0.476 | 0.472 | 0.488 | 0.487 | 0.479 | 0.470 | 0.470 | 0.485 | 0.484 | 0.475 | 0.471 | 0.471 | 0.485 | 0.484 | **0.476** | |
There was a problem hiding this comment.
The current markdown table is very wide and difficult to read due to the attempt to simulate colspan for headers. This is not standard in markdown and may render poorly in some viewers. For better readability and correctness, I suggest restructuring the table into a 'long' format.
Here is an example of a more conventional and readable structure:
| Model | Category | Comprehensiveness | Insight | Instruction Following | Readability | Overall Score |
| ------------------------------- | -------- | ----------------- | ------- | --------------------- | ----------- | ------------- |
| **Qwen3-30B-A3B-Instruct-2507** | finance | 0.181 | 0.169 | 0.191 | 0.211 | 0.184 |
| | others | 0.112 | 0.111 | 0.117 | 0.137 | 0.118 |
| | overall | 0.122 | 0.119 | 0.128 | 0.148 | 0.127 |
| **Tongyi DeepResearch** | finance | 0.291 | 0.282 | 0.316 | 0.313 | 0.296 |
| | others | 0.270 | 0.260 | 0.289 | 0.290 | 0.274 |
| | overall | 0.273 | 0.263 | 0.293 | 0.293 | 0.277 |
| **Claude 3.7** | finance | 0.404 | 0.398 | 0.465 | 0.416 | 0.417 |
| | others | 0.412 | 0.406 | 0.462 | 0.417 | 0.423 |
| | overall | 0.411 | 0.405 | 0.462 | 0.417 | 0.422 |
| **Ours** | finance | 0.476 | 0.472 | 0.488 | 0.487 | 0.479 |
| | others | 0.470 | 0.470 | 0.485 | 0.484 | 0.475 |
| | overall | 0.471 | 0.471 | 0.485 | 0.484 | **0.476** ||
|
||
| 1. Xie, Q., et al. (2024). *FinBen: A Holistic Financial Benchmark for Large Language Models*. arXiv:2402.12659. | ||
| 2. Du, M., et al. (2025). *DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents*. arXiv:2506.11763. | ||
| 3. FInance Tool API:[https://basic.10jqka.com.cn/](https://basic.10jqka.com.cn/601899/equity.html#stockpage) No newline at end of file |
There was a problem hiding this comment.
| cd /path/to/AgentJet | ||
| bash install.sh # TODO:把这部分缩减到一个install:https://yuque.alibaba-inc.com/bayotg/wxz7sb/qdesuu33621x2yhi | ||
| # 安装ajet请使用uv | ||
| git clone -b dev/shuchang_newjudge https://github.com/modelscope/AgentJet.git |
There was a problem hiding this comment.
The tutorial instructs users to clone a specific development branch (dev/shuchang_newjudge). This is not ideal for documentation, as development branches can be temporary, rebased, or deleted, which would break the instructions for future users. It's better to point to the main branch or a stable release tag.
| git clone -b dev/shuchang_newjudge https://github.com/modelscope/AgentJet.git | |
| git clone https://github.com/modelscope/AgentJet.git |
| | `EBTU_WEIGHT` | 0.0 | 证据溯源权重(可选启用) | | ||
| | `AUDIT_WEIGHT` | 0.0 | 引用逻辑审计权重(可选启用) | | ||
| ```bash | ||
| bash AgentJet/tutorial/example_deep_finance/deep_finance.sh |
There was a problem hiding this comment.
The path in this command is incorrect. The preceding instructions have the user cd into the AgentJet directory. Therefore, the AgentJet/ prefix in the path is redundant and will cause the command to fail.
| bash AgentJet/tutorial/example_deep_finance/deep_finance.sh | |
| bash tutorial/example_deep_finance/deep_finance.sh |
| # ------------------ OpenJudge Finance 配置 ------------------ | ||
| # 注意:Finance 评估现在使用 OpenJudge FinanceCompositionEvaluator | ||
| # finance_llm 可单独配置,若未设置则复用 openjudge_llm | ||
| ajet: | ||
| project_name: "{{PREFIX}}" | ||
| experiment_name: "{{SUFFIX}}" | ||
| # Judge 配置(嵌套结构,对应 self.config.ajet.judge.*) | ||
| judge: | ||
| openjudge_llm: {{OPENJUDGE_LLM}} # OpenJudge 模型(用于通用评估) | ||
| finance_llm: {{FINANCE_LLM}} # Finance 评估专用模型(可选,留空则复用 openjudge_llm) | ||
| concurrency: {{JUDGE_CONCURRENCY}} # Judge 并发数 | ||
| train_ref_ans_path: {{TRAIN_REF_ANS_PATH}} # 训练集 Reference Answer 路径 | ||
| val_ref_ans_path: {{VAL_REF_ANS_PATH}} # 验证集 Reference Answer 路径 | ||
| # 权重配置 | ||
| # rm_weight: Finance 评估权重(使用 FinanceCompositionEvaluator,支持 stock_analysis/industry/macro/event/search) | ||
| rm_weight: {{RM_WEIGHT}} | ||
| presentation_quality_weight: {{PRESENTATION_QUALITY_WEIGHT}} # 报告呈现质量评估 | ||
| grounding_weight: {{GROUNDING_WEIGHT}} # 引用规范性评估 | ||
| cgcv_weight: {{CGCV_WEIGHT}} # Citation-Grounded Claim Verification | ||
| audit_weight: {{AUDIT_WEIGHT}} # 引用逻辑审计 | ||
| traceability_weight: {{TRACEABILITY_WEIGHT}} # 可追溯性/可核验性审计 (TVR) | ||
| ebtu_weight: {{EBTU_WEIGHT}} # EBTU证据优先可追溯性审计 |
There was a problem hiding this comment.
This YAML template file appears to be unused. The training scripts (deep_finance.sh and deep_finance_single.sh) use deepfinance_template.yaml instead. Furthermore, this file contains placeholders like {{CGCV_WEIGHT}}, {{TRACEABILITY_WEIGHT}}, and {{EBTU_WEIGHT}} which are no longer defined or substituted in the shell scripts. If this file were to be used, it would cause a configuration parsing error. To avoid confusion and prevent future errors, it's best to remove this file from the repository.
Description
This Pull Request introduces the FinWorldJudgeByOpenJudge protocol to enhance the automated evaluation capabilities of AgentJet in financial scenarios. Additionally, it includes comprehensive documentation updates, including bilingual blogs and an improved README to better guide users and contributors.
Key Changes
1. Core Logic & Evaluation
FinWorldJudgeByOpenJudge, leveraging theopenjudgeframework to provide more nuanced and reliable scoring for financial agent tasks.openjudgeto the project requirements to support the new evaluation backend.2. Documentation & Community
README.mdwith clearer setup instructions.3. Git Maintenance
dev/shuchang_newjudgeand themainbranch to ensure a clean merge.Type of Change
FinWorldJudgeByOpenJudge.